Reale Mutua

In preparation for the technical interview, the candidate is asked to complete an exercise of Data Science with 48 hours of time. The idea is to simulate the final presentation that he/she would attend to an hypothetical internal stakeholder during the interview. The result of the exercise is thus used as basis for the presentation and to comment the choices from a technical point of view. For the sake of simplicity, it is highly recommended to ask the performance of the analysis with Jupyter Notebook, and the result to be sent as html file. Below the instructions of the exercise, the dataset has to be loaded from the file "dataset_reale.csv".

Introduction: the Data Science Center of Excellence has been asked by the business personnel to improve the activity of the Welfare office. In particular, the solution needs to explore the possibility to support the smoking customers against the onset of any cardio-respiratory diseases with a digital app. By accessing its historical database, the Welfare office produces a dataset to be provided for analysis.

Definition of the starting dataset:

First goal: An exploratory analysis of the dataset proposed by the candidate that enables the Welfare office to better understand the data it provided. Second goal: An algorithm that predicts the probability of contracting a cardio-respiratory disease on the basis of the data observed. Qualitative parameters should be provided as well in order to validate the chosen method.

Info: The explanations will be valued very positively.

Exploratory Data Analysis (EDA)

Insights from EDA:

Modeling

Split train and test set

Oversampling the data

There very few nan values, therefore I decided to drop all nan values from the dataset.

Train

Calibration

Evaluation

Feature importance and explainability

Conclustion

  1. The model performs very well for both precision and recall.
  2. The model provides a reliable probability of contracting a cardio-respiratory disease.
  3. The model could help people to prevent cardio-respiratory disease by continuously monitoring the generical physiological value.
  4. The model shows that there are at least three physiological parameters that strongly affect the probability of contracting the disease. They are column6, column4, and column1. People should carefully monitor these values.
  5. Health is a very sensible field, therefore is very important to evaluate the result of the model and the features column6, column4, and column1 with experts physicians.